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SQL-BASED ANALYTIC ALGORITHMS 

CROSS-REFFRFNCE TO RF.LATED A PPLICATIONS 
This application claims the benefit under 35 U.S.C. Section 119(e) of the co- 
5 pending and commonly-assigned U.S. provisional patent application Serial No. 
60/102,831, filed October 2, 1998, by Timothy E. Miller, Brian D. Tate, James D. 
Hildreth, Miriam H. Herman, Todd M. Brye, and James E. Pricer, entitled 
Teradata Scalable Discovery, which application is incorporated by reference herein. 
This application is also related to the following co-pending and commonly- 
10 assigned utility patent applications: 

Application Serial No. --/---,---, filed on same date herewith, by 
Brian D. Tate, James E. Pricer, Tej Anand, and Randy G. Kerber, entitled 
SQL-Based Analytic Algorithm for Association, attorney's docket number 
8219, 

15 Application Serial No. --/---,---, filed on same date herewith, by 

James D. Hildreth, entitled SQL-Based Analytic Algorithm for Clustering, 
attorney's docket number 8220, 

Application Serial No. --/---,---, filed on same date herewith, by 
Todd M. Brye, entitled SQL-Based Analytic Algorithm for Rule Induction, 
20 attorney's docket number 8221, 

Application Serial No. --/---,---, filed on same date herewith, by 
Brian D. Tate, entitled SQL-Based Automated Histogram Bin Data 

• ; " " ' :; n ApplScatibrf Serial No. 1 -' -/*- * -r- - -^fiLed ©n;same date herewith, by 
25 " . Brian D. Tate, entitled SQL-Based Automated, Adaptive, Histogram Bin, 
; Data Description -Assist, attorney's docket riujrnber 8^223, ^ 

^ ' - ; Application Serial No. PCT/US9,?/-p -, %d on same date £•;•* 
*t . herewith, by Timothy E. Miller, Brian D. fafc,Mirlam H. Herman; Tfcdd 
M. Brye, and Anthony L. Rollins, entitled Dal&Mining Assists in a 
30 Relational Database Management System, attorney's docket number 8224, 

, . : T i 7 Application Serial ^No. --/---,-- -, filed on same date herewith^ by 
Todd M. Brye, BrianX). Tate, and Anthony L. Rollins, enti)ded SQL-Based 
^ Data Reduction Techniques for Delivering Data to ^Analytic Tools, 3 ;. 

attorney's docket number 8225, ; ,. 
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* » * Application Serial No. PCT/US99/ - - - ^ -, ; filed on . same, date,, 
herewith; by Timothy E. Miller, Miriam H. Herman, and Anthony L. 
' Rollins, entitled Teehniques for Deploying Analytic Models in Parallel, 
Attorney's docket number 8226, and ';. - . : •,• -■ -\ . ■ : 

Application Serial No. PCT/US99/- - - - med.on.same date 
herewith; by Timothy- E. Miller, Brian D. Tate, and Anthony L. Rollins, 
entitled Analytic'Logical Data Model, attorney's docket number 8227, 
i y - ; all of whicl^are' incorporated by teference^ereu^ r ; /. : , 

) ; r . : G i- o .WACKGROT TND OF T WF TNVENTION 
" jS;f "' :i l. Field of the Invention. . .: :r i ? ... : , s ■ . •;. .- ..• 

1 ' ' -This indention relatesun general to a relational database management 
system, and in particular, to SQL-based analytic algorithms that provide statistical 
and machine learning methods to create^alytic models fromthe. data residing in a 
' relational database. 1 " 1: '- 15 - ■< - ■ ■<•<■■>-}■■*>• ' -v.- ''o.**;-. 



2. Description of Related Art:: .. . v- r :/: ' . . 

Relational databases ; are the predonnnate form of diabase manageinent 
systems used in computer systems. Relational, database management systems are 
ofteii used in' sd-called "data warehouse" appUcatipns where enormous amounts of 
data are stored and processed: In recent, years, several trends have converged to 
create a^new' class^f data^arehousing appUcations known as data mining 
applications. D^miniflg' is' thfe process of identifying and , interpreting patterns in 
: databases; and can' be generalised into three stages. , L ,,r, ,d : 

" - v ' ( J? ; Stage one is-tKe^repbrting stage, which.analyzes ti^e data to determine what 
: mppen : ed. 'Generally, most data warehouse impleme|ijaaqi^W..wi^ ) ai«»sed 
application in a specific functional area of*he,business. These applications usually 
^ 1:i =foc^on^rep6'r^^hmorical: snap shots of business information tiiat was previously 
'"-"^cu* or'impo^ibleid access. Examples include Sales Revenue Reporting, 
Production Reporting and Inventory Reporting to name afew. ?: :a6 , , 

Stage two is the analyzing stage, which analyzes the data to determine why 
it happened. As stage "one end-user* gain previously unseen views of their business, 
they quickly seek to understand why certain events occurred; for example a decline 
in safes revenue. After discovering a reported declinein sales, data warehouse users 
: w m then obviously ask, "Why did sales gbzdown?^ Learning the answer to this 
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question typically involves probing the database through an iterative series of ad 
hoc : or multidimensional queries until the root cause of the condition is discovered. 
Examples include Sales Analysis, Inventory Analysis or Production Analysis. 
Stage three is the predicting stage, which tries to determine what will 
5 happen. As stage two usersi>ecome more sophisticated, they begin to extend their 
analysis to include prediction of unknown events^JFor example, "Which end-users 
are likely to buy a particular? product*, or "Who is at,r;sk of leaving for the 
competition?" It is diffleultfop humans to see or interpre^s^btle relationships in 
data, hence as data warehouse users evolve to sophisticated predictive analysis they 
10 soon reach the'limits of tradltaonalquery; aid reporting^ools. Data mining helps,, 
end-users break through these limitations by leveraging intelligent software tools to 
shift some of the analysis burden from the human, to the. machine, enabling the 
' ' ' 3 discovery of refe^o&hips- thaf Hrere previous.^ unknot, . . i; • , ; 

a • -Many niinirig tecmiologies are, av41able,^rom. single algorithm 
15 solutions to complete tool suites. Most of these technologies, however, are used in 
a desktop environment where little data is captured and maintained. Therefore, r 
most data mining tools are used to analyze small datasamples, which were gathered 
froth various ^sources into' proprietary data; structures,^ On the other 

hand, organizations are beginning to amass ^ery - la, rge databases and end-users are 
20 ' 'asking more complex que^ v • ; : 

Unfortunately ; most i^minmgite^Qk^es^c^gttb* used witlrlarge .... 
volumes of data. Further, most analytical techniques used in . data mining are * 
r '- itv algonthnuc-baW fathe'r^ - 
synergy between data mining»andidau Mi^mi^^^r^^r^fjco^ ^MS^bility 
25 perspecWe^tradif ioiial ^t*rnining techniques m t°? *9mfe*. feS** bv 

^dnunistratbrs^nd application programmers, and,are tc^-d^icult.to^ange for a 
" ' cU different industry or af-diff^eat«istome.k.;'Tc;:-i. -m vxii ^a ; re - Ji; 
; - ^ : j v.w Thus^here isa need in the art for. d*ta[ -mipiij^ ^jUwtiops^hat directly 
■ operate agairist^data ^ warehdusesi^d that allow po^-stat^sticians p benefit from 
30 advanced mathematical techniques available in a relajiqnal environment. c; 



■• ! ■■■■>,< SUMMARY O F THE .INVENTION ,, 

: : - To overcome the limitations in the prior art described above, and to 

' overcome other limitations thatwill become apparent upon reading, and 
35 • understanding the present specification, the present invention discloses a method, 
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apparatus,' and article of imamifacture for perforrningdata mining applications in a 
relational database management system. At least one analytic algorithm is 
performed by a computer directly against a, relational database, wherein the analytic 
algorithm includes SQL statements performed by the relational database 

5 management^ system and optional programmatic iteration, and the analytic 

algorithm creates at least one analytic model withm an analyse logical data model 
from data residing in the relational database. , ; J r , r . , . v - : - f 

An object of the presentinyention ^^pre^^fjW? ^^ eat usa S e °* 
parallel processor computer systems. , An object,pf : Ae present invention is to 

10 provide a foundation for data mining tool .sets, in Relational database management 
systems. Further, "an object of the present inyentionis to f allow data mining of large 
databases. ',/ : : . - ■ .j 

TVRTF.F DESCRIPTION OF THE PR A WINGS , 
15 Referring now to the drawings- in w^cb^l^referc^cj?. numbers represent 

corresponding parts throughout: j ; - : i .■;/,< ; r >r , . i 

FIG. 1 is a block diagram' that illustrates- an exempbry. computer hardware 
environment thatxould be used with the preferred embodiment of the present 

invention;- • **• ;>:;*.• -r: ■-■nw .-j s i:, '- '.. : 
20 FIG. 2 is a block diagram that illustrates an exemplary logical architecture 

that could be used with the preferredembodiment.of the present invention; and 
FIGS. 3, 4, and 5 are flowcharts that illustrate exemplary logic performed 
according to the preferred embodiment <©f the .present, invention. 

SJH'x-r.-yr.: ' ' ■ ' I A ■<-. ■:? 
25 . ^ DETAILED DTTS^RIPTION THT?. PREFERRED EMBODIMENT r 

In the following description of the preferred embodiment, reference is made 
to^e'-ao^mpjuiyiiig.i3pwings," wiiiekform,a.part hereof, and in which is shown 
by way of illustration a specific embodiment in, which the invention may be 
n practiced. It is to be ^der}ac^i1chat.^^^e^bcM|iments may be utilized and 
30 k ; structural: changes may be made without departing from the scope of the present 
ir '■ invention. --^ ... v.-; •■<..•. ,■ [. ?;<■■■[ 

OVERVIEW 

The present invention provides a relational database management system 
35 (RDBMS) that supports data mining operations of relational databases. In essence, 
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1 idv^ced analytic processing capabilities for data mining applications are placed 
where they belong, i.e.",' close to the data. Moreover, £he results of these analytic 
processing capabilities^can be made^tb persist within the databases can be exported 
from the database. These analytic processing capabilities and Ttheir' results are 
5 exposed externally to the RDBMS by an application programmable interface (API). 
According tb the' preferred embodiment* the data mining, process is an 
iterative approach referred to as a "Knowledge Discovery; Analytic Process". 
(KDAP). ''There arV^ ! n%Jor tasks" withittthe 1£>APi io v. 

1. ' 15 i Uh<^bad^g*diis : 'business 'objective: -iq-r. .» .• o : : . « 
10 '" " 2. '"' 'Uhders^uiding-tfie soufce data available': . > •»•«•••> - o 
, 3 selectmg the'dat^set and-" : • 

4. Designing the analytic model. ; -. r 

5. Creating and testing the models. 

6. - 1 - Deploying^the analytic models. A \ 

15 15 The presfenf inv^^ tasks: 

• An RDBMS that executes Stnfctiiped Query Language (SQL) 
' ^" ' ' s&tements a*gaSist ; a relatfonal^t^baseu: ;r 

■ * ' Ali k^alytic Application Programming Interface (API) that creates 
scalable data mining functions comprised of complex SQL . 

20 u "' : '■"' ' statements. 1 ' '* " ■'" :Ti -' - ' ' '•- '- '■ ' '■ ■■■ H 

o v.- Application proikms r that insWfiate andparameterize the ; analytic 

-'. ■ .' - A pj> ' •■ ••. 5 7r.:'.- ,:rt*,b, r ! s ■: -, -? > .. ... t 

• • ° Analytic algcmth^ •••• • 

■ Extended ANSI SQL statements, 

: ? ; ' ; " • •-•■*•*• * a Data lleductic& ^ 

7 ; .' ' - sWettien&and pregfammatif2 iteration^ Hi ' - x * 

- : w r --CAh ahaly^cat'ibgicat-^ata-m-ckld (LDM) that stores results -ftpm and 

30 ; " : ■ : '•''infor^tion'aboul^he advanced analydcprocessing in-the-RDBMS. 

• A parallel deployer that controls parallel execution of the,,results of 
the analytic algorithms that are stored in the analytic logical data 
model. ' - - 'i 
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The rberiefits of the present' inventicin include: .0 : ; > '1 V . , 

Data mining of very large databases directly within a relational 
•"• ^database. " - 

• -' } ''Management of^mafytic results -within a :relafional database. ,^ 
4 . # »• - A comprehensive set of analytic operations that operate within a 
1 ^ relational database management system^ s. 

■"' 1 * r Application' integration through an .ohjecj-priented API., 
These components and benefits are described in mOreidetaH; belpw. , 

f'Ui MARDWAP V .irarVTRffiNMENT 
: ' FIG. i is a Block diagram that illustrates an exemplary computer hardware 
environment th^t &uld be used with the preferred embodiment of thepresent 
invention. In thVexemplary computer hardware environment,,* massively parallel 
' processing (N^P) computer 'system 100 is comprised of one [or more . processors or 
nodes 102 interconnected by a network 104. Each of the nodes 102 is comprised of 
1 oheW^re'pro^^Aid^m^cesi memory,(RAM), read-only memory 

(ROM), and othe^ components. It is Envisioned that attached to the nodes 102 may 
be one or more fixed and/or removable data storage units pSUs) 106 and one or 
more data conim^ications unite pCUs)4Q8, as is well known in the art. 

Each of Ine nodes 102 executes one or- more computer programs, such as a 
Data Mining AppHcatib^ (APPL) 110 performing data mining operations, 
Advanced Anai^***s^g Components (AAPG) 112 for providing advanced 
analytic procei^g tapabiiiues fof the data miningi>pe*ations, and/or a Relational 
v Database Man^emlsnVSystem^ (RDBMS) 1 14 for managing a relational database 116 
° stored or! one J dr more of the DSUs 106 for use in the datarnining applications, 
wherein various operations are performed in the APPL HQ, AAPC 112, and/or 
^' TU&^^W in response W tonmiands from 'one or more Clients 118. In 

alternative embodiin^ntsj the APPL 110 nlkyrbe executed in one or more of the 
: " Clients li8, or Oti ? an Application server* on a different platform attached to the 

' ^network 104.' 7 - f '~ : '■ " ' ' • ■ 1 " ' c "' x V 
~ 1 11 J i Generally, theWmputer programs ate tangibly embodied in and/or 
* ■ re&ieved from RAM,i ROM, one or more of the DSUs 106, and/or a remote device 

coupled to the computer system 100 via one or more of the DCUs 108. The 
"^computer programs comprise instructions which; when read and executed by a 



Wobd/20982 



7 



PCT/US99/22966 



node 102, causes the node 102 to perform the. steps, necessary to execute t^e steps or 
elements of the present invention* . > 

Those skilled in the art will recognize that the exemplary environment 
illustrated in FIG. 1 is riot intended to limit the present invention. Indeed, those 
skilled in the art will recognize that other alternative hardware environments may 
be used without departing from the scope of thfe present, invention. In addition, it 
should be uriderstbod that the present invention may also, gpply to other computer 
programs than those disclosed herein. » / -d; ; ^ - u u; . ; 

1 •':.! ^ LOGICAL ARCHITECTURE 
FIG. 2 is a block diagramithtt illustrates an exemplary logical architecture of 
the AAPC 112, and its interaction with the £PPL ...110, ^B^S114, f rdatiopal 
database 1*6/ and Client 118, according to the preferred embodiment of the present 
invtfritidhV 1 In the preferred embodiments the. AABC 112 includes the following 

' v cdmponehts:-^--- ~ ' f ' J r-jcy. ■ ':v! ; \- .-.^ • ■ • \ ■ 

~. • ; - iAn AhalyticLogical Data Model (mjVl) 200 fhat stores results from 
- J ^ thfe advanced analytic processing in theRDBMS 114, 

• > O 

' <* 1 7 comprise complex, optimized SQL statement that perform 
t -\: ; rr:> advanced analytic processing in the RDBMS 114> 
- ■ • An Anal^c Applicat^ 
: ( J ; provides & mechanism for, an ^RPL ll(^of; ot^xer component to ; 

- b ■ -anvokevtl^^^ r . „ L , 

r o :,. * , ; : . Qne or more An^^ 
1 Iq ■ ^/applications or can? be imrpked ^ ^ot^r^cpjnppnentjr wherein die 
r n D'iA/. Akialytic Algorithms 206 t qpmp^^^^ . lKt ^. T /u ^ 
a 1 • « : ^ ? f;-» , ■ .. • Extended J^NSISQL^OSj^^f^^ p?<$d to^implement a 
> . ..or , - ? f • ;: certaih^las&of Analytic : Al^9whi?^ 206, 

' i 7i AaCkH L^^Merface {CL^IQ that can be us§d when a 

combination of SQL and programmatic iteration is required 
-^r to implement T a certain class of Analytic Algorithms 206, and 
j r ■ . ! j , ! A Data : Reduction Utility Program 21^ that can be vised to 
^ implement a certain cl^ss of Analytic Algorithms. 206 where 

: , • data is first; reduced using SQL followed by programmatic 
iteration. 
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• • Ai Analytic Algorithm Apphcatidn Programming ;:Interface (API) 

1: iv 2'i4 tKat provides a mechanism for an APPL-11-0. or other- 
' ' components to invoke the Analytic Algorithms 206, , 

• A Parallel Deployer 216 that controls parallel executions of the 

: results of an Analytic Algorithm 206:(sometimes. referred to as an 
u ' i '- Analytic motfetyiihat are stored in the Analytic LDM 200, wherein 

' ' \he results of executing ;the Parallel Deployer 216 are stored in the 

io ?, ' RDBMS 114. • '• ' ooJ ' '-^h>"*'- : '^ ''■=•■■> >•'■';■ • 
! ' Note that the use of these various components is optional, and thus only 
some of the components may be used in any particular configuration.; 
° The preferred embodiment is orientecr-towards a multi-tier logical 
' architecture, in which a Client 118 interacts with the various components described 
above, Which/ in Wn^mterf&e to the RDBMS 114 tosutiUze.a large central 
repository of enterprise data stored in ^ the'reIatiorial cUtabase: ; 116.for.analytic 

.'Mi Ir-oJ-rtoh" r;. ■■}.]■;} .--r •'.'>- <■ i ■• 't !»=••,••• - .-. 
processing. • - 

In one e^ple;'a'C^ent'118- : inienUte-wlth an APPL 110, which interfaces 
to the Analytic API 204 to invoke one or more of the Scalable Data Mining 
Functions 202, which are executed by the RDBMS 114/ The results from the 
execution of the Scalable Data Minihg^Functions 202 would be stored as an analytic 
model within' an A^y^ 1^ 200 in the RDBMS 114. - ~. V. 

'* In anotlier 4a^mple, a Client 118 intef acts with one or more Analytic 
Algorithms 206 1 ei&ef directly' or via trie Analytic Algorithm API 214- The 
Analytic Algbrithltni' 2d£»comp1fise SQL statements that 'may or may not include 
prograrnmatid iteration;- and the SQL statements are executed bythe RDBMS 114. 
In addition, the Aiialytic Algorithms^ ^206 may or may cot interface to the Analytic 
API 204 to invoke one or more of the Scalable Data Mining Functions 202, : which 
are executed by the RDBMS li4V Regardlesc| ; the results from the execution of the 
Ai^y^cAigonthi-^^olutd be stored a? an analytic model within an Analytic 
1 LDM 2d0intheRi3BMS'il4. ' ; •' 

In yet another example, a Client 118 interacts with the Parallel Deployer 
216, which invokes parallel instances of the results of the Analytic Algorithms 206, 
' sometimes referrecTto' as- an! Analytic Model/ The Analytic Model is stored in the 
Analytic LDM 200 as a result of executing an instance of the Analytic Algorithms 
' 206. The results of executing the Parallel Deployer 216 are stored in the RDBMS 
114. 
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' Ih> still another example, a Client 118 interacts, with the APPL 110, which 
invokes one or more Analytic Algorithms 206 either directly ,or : via the Analytic 
Algorithm API 214. 'The results would be stored as an analytic model within an 
Analytic LDM 200 in the RDBMS 114. \, s , : ; : 
5 ' The overall goal isitb;significantly improve the perfpriyiance, efficiency, and 

scalability of data mining operations by performing compute and/or I/O intensive 
operations in^theivarioiis c&mponeqts. The preferred embodiment achieves this not 
only through the parallelism provided by the MPP computer ^system 100, but also 
from reducing the amount tofe data tljat flows between the APPL 110, AAPC 112, 
10 RDBMS 114, Client 118» and; other components. , , 

Thoseskilled in the art willr recognize that the exemplary configurations 
' illustrated and discussed in conjunction withJFIG, 2 are rot intended to limit the 
present invention . Indeed*' thofse skilled in the art will recogmze that other 
alternative configurations 'may Jfc>e used .without. departing from the scope of the 
15 present invention. In addition, it should be understood that the present invention 
• may also apply to oiher .componepts than those disclosed herein, 
f.!-.!*'^;.-?;;-.// -\ , .- (fV; - % * ... 
ScalableiData Mining Functions 

The Scalable DitavMinijag Functions 2Q2 cQmpri^e complex, 
20 optimized SQL statements that gre .created, , in the preferred embodiment, by 
parameterizing and instantiatmgtji^ 204. The 

Scalable-Data Mmin^lFunetipns 202 p^rfox;px much of the.advanced analytic 
processing fox data roiniag^pl^ by the RDBMS * 

114, without having to jjijoye data froi^xlje.^ 116. 
25 ; i -The Scalable. Data Mining^unctions 2Q% can be categorized by the „ 

- - • following functions; , ' \ C e,?d-: Jov. ; . t .- v . u .,; . Ar , . ,... v . 0 

^ r v: n: 3© Data Description : .The ability to pj^r|tand and describe the 
v ' a - .a- ;dd :'rr: available 4zt% vising ftatistiq^l ^ch^qjiqs. Fgr example, the 

generation of descriptive statistics^ frequencies and/ or histogram 
30- <. " i/r.-C f:. :r *JbSn& * - - - t ;.; n .i v ^ ^ , r . . .. ( .,. _ tV 

: ; ' \oilk r Data Derivation *. JDhq ability to generate new variables 
: : : .(transformations)/^^ 

" '-. . an analytic; model. For example, the generation of predictive 
O r > n variables such as bitmaps, ranges,, cpdes and mathematics functions. 
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: - . ? : nata ^Reduction : The ability to reduce the number of variables 

— v (columns) or observations (rows) used when designing an analytic 
•J ?,. model. For example, creating Covariance, Correlation, or Sum of 
-"ir Squares : and Cross-Products (SSCP) Matrices. - ?! r 

• ■ t /Daia ReoiYanization : The ability to join or deformalize pre- r : 
processed "results into a wide analytic idata;.set. , :' .. ' ->,r , 
s : ■. • Data^ Samplin ft/Partitionirig : The lability to intelligently request 
>:>:->.• , 3 different data-samples or data partitions., Fp]r example, hash data 

• £•'.•' partitioning or data sampling, .-.v-vus '' >'.. JO-' ? '• 
. >': The principal theme of the Scalable Data Mining Functions 202 is to 
facilitate analytic operations within the RDBMS 114, which process data collections 
stored in the database 116 and produce results that also- are stored in the database 
- 1167 Since data mining operatibns tend to be iterative and exploratory, the database 
-i-16 in the preferred embodiment comprises a combined storage and work space 
environment. As such, a sequence of data mining operations* is viewed as a set of 
steps that start with 'some collection of tables in^the database. 116, generate a series 
of intermediate work tables, and finally produce a result table or view, . : ' -; 

. . • : !.. ; :.[. .;■-} : O": ,V'.'J.. ■ ' tV.Z ...11, ■ r.'i ' -. :• '.. • 

Analytic Algorithms ' .* --'i. < :d- ?t < . r.w . • 

-The Analytic'Algdrithms 206 provide statistical and "machine learning" 
methods to create Analytic LDMs 200 from the , data residing in the relational 
database 116; Analytic ; Algor:thms<206 that;^ as 
association, can be implemented solelyin Extended ANSI SQL- 208. Analytic 
Algorithms 206-thaivrequife a combination of SQL and programmatic iteration, 
such as iriduetionf can bs'implementednismg the CLI 210. Finally, Analytic 

c Algorithms 206 that require almost complete r programjiatic.jteration,- such as 
clustering, can be implemented using a Data-Reduction Utility Program 212, 
wherein this approach involves data pre-processing that reduces the amount of data 
that a non-SQL algorithm can then process. j i- i. ' r : 

i > >. The Analytic Algorithms 206 significantly improve the. performance and 
efficiency of data miriing operations by providing the technology components to 

- perform advanced analytic operations directly against die RDBMS 114, In addition, 
the Analytic Algorithms "206 leverage the parallelism that exists in the MPP 

; computer system 100; the RDBMS 1 14, and the database 116. 
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The Analytic Algorithms 206 provide data analysts, with an unprecedented 
option to ti^ain and apply "machine learning" analytics agairist massive amounts of 
' ^ data in the relational database 116. Prior techniques have, failed as their sequential 
design is not optimal in an RDBMS 114 environment. Because the Analytic 
5 Algorithms 206 are implemented in Extended ANSI SQL 208,. through the CLI 
210, and/or by means of the Data Reduction Utility Rrogram £12, they can 
therefore leverage the sc^bility available on the MPP^omputer system 100. In 
addition, taking a dat^-drfveii approach ito analysis* through the use of complete 
Extended ANSI SQL 208, allows people other thanAighly educated statisticians to 
10 leverage the advanced analytic techniques offered by the Analytic Algorithms 206. 
;.<.;- < ■ " " ,; * ^l\ : iCf ,i,>.Tv v;. ■': :> /.-v'.-.ar 

•^J - 1 < Extended ANSrSOL ■ * c ^^r.t ' - s„ 

As mentioned ab€>ve; Andytic Algorithms 206 that iare cpmpletely data 
driven, such as affinity /analysis,) can be implemented solely in Extended ANSI SQL 
15 208^ -Typiedlyv algorithms operate against a^et&f tables, in the 

relational database lll6 that are^poptdated with transactionTleyel data, the source of 
which could be point-of-sale devifces^ automated f teUer^ma^chines, call centers, the ^ 
Internet, etc. The SQL statements used to process this data typically build 
relationships between and among data elements in&he tabl^, ; For example, the 
20 ' SQL statements used to process data fmiri point^frS^^ devicfes , may build 

relationships between afcd amorig products aiid pairs; of: products. Additionally, the 
dimension of titae tan b^ added m ^ 

analyzed to deterriifcelibw theyr«d^gQ; o^ solely 
in SQL statelnettts,vchfe design taikesr advantage *>£ the hardware an*i software 
25 e^viromnenttxf ihe pf^rr^embodiment, by d^PP^Qsing tjbis SQfc statements , 
intoi a plurality *>f sbrt and merge stepsvthafc be executed conrarrejatly |n parallel 
By iheMpP ^'ocb^ut^r system °nhu : \ :f^^:n^hiiu aj. .^uy.-izuL 

Call-Level Interface >,s?-jox^ v:l K^t.r y z J , ■: ;;cv: ^ "ir^ 

3d i - ^ As mentioned above, Anal^ic^Algorithms 206 that rfequire a mix of 
'-" ' prbgr?tminatjc iteration along withExtended ANSI §Qt statements^sujjh as 
1 inductive ihfefen^e/can be implemented using the CLI 210, Whereas 3 tl^e SQL 
approach is appropriate for business problems that are descriptive itf ; naturey 
inference problems are predictive in nature and typically require a traifiing.phase 
35 where the APPL 110 "learns* various rules based upon the data description, 
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followed by testing and . application, and, where the rules are yaUd^ted.and appUed 
against a new data set. This class; of algorithms are compute-intensiye and 
historicaliycan not handle large yolumes. of data because they expect the analyzed 
data to be in a specific fixed or variable flat file format, ; 

Most implertentsttioris first extract, die dataifrom tjie database 116 to^ 
construct a flat-file' and then execute the "train" portipn on this resultant, file. This 
method is slow and limited by the amount of^^^ayajlabkin the computer 
system 100. This process can be improved byjeveraging ^ 116 
to perform those portions of the analysis, instead olextraonng all the data. 

When SQL statements and programmatic iteration are used together, the 
RDBMS 114 can be leveraged to perform computations and order data within the 
relational database 116, and then extract the information using very little memory 
in the APPL 110;' -Additionally, compujations, aggregations and/or ordering can be 
'' run in parallel; because of the massively parallel nature of the RDBMS 114. 

J naWRedii^on Util ity Program ' ! ■ , Oi l \ - n:< . ' 

1 As mentioned above, Analytic Algorithms 206 that can, pperatt on a reduced 
or scaled data set, such as regression or clustering, the Data Reduction Utility 
Program 212 catf be" used. The problem of Sweating analytic models from massive 
amounts of detailed data ihas often been addressed by sampling, mainly b^ause 
compute intensive algorithms cannot handle large volumes of data. The approach 
o ! f the Data Reduction ^Utility ^Program^ll.k.tp, reduce data through operations 
such as'matrix calcinations or hikogifamvbipntt 3&: ai^en use this reduced or 
scaled data aMiiput*6 M nbr*SQL algoi^m^ This meriiod intentionally reduces 
fine numeric^data derails by assignnig them to ranges. c-r bins, correlating their 
values or determmin^ their covariances, Thejcapaciiy of ^he preferred embodiment 
for creating ^ese data=strbcturescfrom n^uy^amounts pi data in paraUel gives it a 
special opportunity in this area. ,.; ~ u u ; 7 , . . ; 

••'"'•I '' • '- '' ' •'>' • '* * ' •■'>■•'■ " • • O ' :.r.l *:{■■>■;■ :. -. I-': ~ --,ei;". 

Analytic Logical Data Model ;;o : 'i I ' > 

ir x !i. The Analytic LDM 200, which is integrated with the relational database 116 
and the RDBMS 114, provides logical entity and attribute definitipns/or advanced 
analytic processing; i.e., the Scalable Data Mining Functions 202 and Analytic 
Algorithms 206, performed by the RDBMS .114, directly against the relational 
database 116. These logical entity and attribute definitions comprise metadata that 
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define &e charactei^tics of data stored in the relational database .116, as well .as 
metadata that determines how the RDBMS 114 performs the advanced analytic 
processing.' The Analytic LDM 200 also stores processing results from this 
advanced analytic processing, which includes both result tables and derived data for 
5 the Scalable Data Mining Functions 202, Analytic Algorithms 206„ and the Parallel 
Deployer216. The Analytic LDM 200 is a dynamic model* 'since the logical entities 
and attributes defmitions chafige depending upon parameterization of the advanced 
analytic processing' afid iince the Analytic LDM 200 is updated with r the.results of 
the advanced [ analytic processing; ! ■ : :.-:o«jv.o- h :rv : _ ; ; , t , v 

io " : ' /' : ' : ' [ ' ' - ' : ' ■ — - ^■■■'^'i.:-, 

" J ' ' :l - fogicof the Preferred Frnbodiment : <-. . > * r.r,- >Al ilV.:.^ 

Flowcharts which illustrate the logic of the preferre&embodiment of , the 
! present invention' are pf ovidedin'FIGS. 3, 4 and 5. Those skilled in the art will 
recognize that this logic is provided for illustrative purposes bnly. and that different 
15 logic may be used to accomplish the same results. 

Referring to FIG. 3, this flowch^iillustrates the: lpgic of the Scalable Data 
' ' Mimhg Functions' 202 according to thfc preferred embodiment of the present ; 
invention'. '"" ■ • ••" ' '■ £ t C r. : z**> '■• •• . 

Biock"300 { represents i-'the one or ^ moreroflthe' Scalable Data Mming . Functions 
20 202 f bemg created via the API 204V This'ioay entailer example, the instantiation 
of an object providing the desired function.; - i'v-.dnvo;;L. .£•-. 

%i&k'3b2reprete^ " 
order to control the operation of the Scalable Data Mining Functions ,202. 
\;<;. gi 0 ^ ^ resen ^ t h e i^ 
25 ' if necessary for the operation ottfreScalable Data.Minmg Function ^202.. r; , , - 

" ^ Block' 366 represents the' API 204 generating ^Scalable Qata Mining 
J! ;:, '%unctton'2 ! 04 ? in tfie form ofadata maiir%queryjbase4 ( pn the passed parameters 
and optional metadata. «•»•» :u.'- ~n . • 

Block 308 represents the Scalable Data Mining Function 204 being passed to 
30 the RDBMS 114 for execution. ■ , :•. yoj v ' 

' '" i :,; ' 1 ' Referring to FIG. 4, this flowchart illustrates the. logic of the Analytic 

Algorithms 206 according to the preferred embodiment of the present invention. 
: - ;/ - 1 Block 400 represents the Analytic Algorithms 206 being invoked, either 
'"' directly or via the Analytic Algorithm API 214. - ; r: . -ji 
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Block 402 represents certain parameters being passed to the Arabic 
Algorithms 206, in order to control their operation. ^ , 

Block 404 represents thefmettdata in the Analytic LPM 200 being accessed, 
if necessary for the operation of the Analytic Algorithms 206. 

Block 406 represents the Analytic Algorithms 206 passing SQL statements 
to the RDBMS 114 for execution and Block 408 optionally, represents the Analytic 
Algorithms 206 performing, programmatic iteration. Thps^HUed in the art will 
recognize that the sequence of (these steps may differ ( from tW described above, 
may not include both steps, may include additional steps, an^may ipdude 

iterations of these steps. ,-: ; 

" Block 410 represents the Analytic Algorithms 206 storing results in the 

: Analytic' LDl«2O0. r i . >. .. ,. . £ ; : >,.. ; . : -.- ...... j 

Referring to- FIG. 5, this flowchart [illustrates the logic performed by the 
RDBMS 114 according to ihe?preferred embodiment of; the present u^vention. 
- - Block'500 represents the RDBMS 114 receiving a query or qther S(^L 

statements. i • :■> i.>:>.Vs;.qf :. s,« : .• -,>h ••, ■■>..-:; -. 

Block 502 represents the RDBMS 114 analyzing the query. 

Block 504 represents the RDBMS 114 generating a plan that enables the 
RDBMS 114 to retrieve the correct information from the relational database 116 to 
satisfy the query. 

Block 506 represents the RDBMS 114 compiling the plan into object code 
for more efficient execution by the RDBMS 114, although it could be interpreted 
rather than compiled. 

Block 508 represents the RDBMS 114 initiating execution of the plan. 

Block 510 represents the RDBMS 114 generating results from the execution 
of the plan. 

Block 512 represents the RDBMS 114 either storing the results in the 
Analytic LDM 200, or returning the results to the Analytic Algorithm 206, APPL 
110, and/or Client 118. 

CONCLUSION 
This concludes the description of the preferred embodiment of the 
invention. The following describes an alternative embodiment for accomplishing 
the same invention. Specifically, in an alternative embodiment, any type of 
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computer, suck as a mainframe, minicomputer; or personal computer, could be 
used to implement the present invention. <"-j .v.i .' , v 

in summary, the present invention 1 discloses; a method,;apparatus, and article 
of manufacture for performing data mining applications in,a relational database 
management system. At least ohe analytic algorithm is performed by a computer 
directly against a relational database, wherein the .'analytic algorithm includes SQL 
statements performed by the relattbnal database management system and optional 
programmatic "iteratio^, 'and'the analytic algorithm creates atjeast, one analytic 
model wi^nW-a^ar^clo^ical-'daia model from data residing in, the relational 
database. _ .c.<u ?* :v? on. 

The foregoing description cf the preferred embodiment of the invention has 
been presented for the purposes of illustration and description , It is- not intended to 
be exhaustive or tolmuVthennveiitioBio theipi^cteiojlto jfisclc^d. --Many 
modifications and variation^ are possible in light of the ab*>ye teaching.. It is 
intended that the scope of tne mVentiori^ ^ detailed description, 

but rather by the claims appended hereto. . n . 
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WHAT IS CLAIMED IS: - - '• • 

i. A computer-implemented system for performing data mining, 

applications, comprising: " ' 

(a) a computer having one or more data storage devices connected thereto; 

(b) » relational database management system, executed by the computer, for 
managing a relational database ; stbred on the data storage devices; and 

' ' -(c) at least oke ^ 

analytic algorithm includes SQL statements performed by the relational database 
management system directly against the relational database and optional 
programmatic iteration; and the analytic algorithm creates at least one analytic 
tiiodel within "an analytic logical data model from data residing in the relational 
database. ' ' 1 J '' ; -' r ri: ' ' 

' 2 . The'compu^er-implemerited^s^^^ 

algorithm provides statistical and machine learning methods for creating jhe 
analytic logical data model.' 



3. The computer-implemented system of claim 1, wheremthe analytic 
algorithm is implemented in Extended ANSI SQL. 

4. The compu&r-iniplemerited System of claim 3, wherein the analytic 
algorithm operates agiin&'a' : &t bf tables W the irelational database, and the 
Extended ANSI SQL build relationships among darcderoents in- the tables. 

' ' " 5. '^e J c»mpfeimplfen^^:syitem of claim^^wherein the Extended 
ANSI s6l analyiei tfe relatibnships to determine Eow the relatiomhips change.. 

' " ' 1 ' ' ^ V' The cbnipUter-ikpleh claimd^herein. the analytic 

algorithm is implemented in a Call Level Interface (CLI) that processes data from 
the relational database using SQL and programmatic iteration. 

7. The computer-implemented system of claim 6, wherein the CLI is 
used with SQL to perform computations, aggregations, and/or ordering on the data 
from the relational database. 
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8 . The computer-implemented system of claim 1 , wherein the analytic 
algorithm is' implemented by a Data ReduttipnUtyjty Program that reduces data 
from the relational database in bulk using SQL followed by^ a non-SQL iterative 
program'.-- ' "- „ .• •, . 



9.n : The computer-implemented systempf cjaim 8, t wherein the Data 
Reduction Utility Program . provides a ^quence^ of Emended A*JSI SQL followed 
by programmati&iterattiono 1 :>.. •-.*;. v J„, 1 . ir .4- ; ; t . 



10: A method .for performing data nnnhi^ 

( a ) managing a relational [database stored on one or more data storage devices 
connected to a computer; and . p . ■ 

(b) performing at least one analytic algorithm in the computer, wherein the 
analytic^algorithm includes SQL (statements performed by. a relational database 
"management system directly t against the relational :4atabase and optional 
programmatic iteration, and the analytic algorimmjCreates at least one analytic 
model within an analytic logical data model from data residing in the relational 
database.' .-' '•'• .(: ■• uv: r.-^ r< - i. ■ i: ;icr~; , • 



11. An article of manufacture comprising logic embodying a method for 
performing data mining applications, comprising;., , . j , 

b ( a ) managing a relationaLdatabase stqred_ on, one or more data storage devices 

connected to a computer; ^and n ; .;f^. : f ; .: ; ; : [- rv ; TO. : ' . 

(b) performing at least one analytic algorithm in'the computer, wherein the 
analytic algorithmtincludes SQL . statements, per^p ; rmed by. a relational database 
- WiagemenVsysW directly against ^relational S^^e, and optional 

programmatic iteration, and the analytic algorithm creates at least one analytic 
model within lanianarytic logical dat^m^^from data residing in the relational 
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